This RMD serves as a guide/lab journal to document my methods for processing and modelling FT-NIRS otolith spectra from juvenile Pacific cod (Gadus macrocephalus) in the Gulf of Alaska. These scans were acquired on a Bruker MPA II with an integrating sphere, 5mm. teflon disk taped over the scanning window and gold transflectance stamp placed over the top of samples. Spectra were collected between 11,500 and 4,000 cm-1 with a resolution of 16 cm-1 and 64 replicate scans.

LPW Scans refer to specimens collected near Little Port Walter (LPW), a NOAA research station located near the southern tip of Baranof Island in Southeast Alaska. Juvenile cod were collected using beach and purse seines from embayments near LPW on July 27-28, 2020. These fish were transferred into outdoor net pens at LPW to be reared in captivity. Specimens were collected approximately weekly (ADD MORE IN HERE LATER)

Packages

This segment is adapted from code provided by Dr. Esther Goldstein at NOAA’s AFSC in Seattle. (ADD INFO ON PACKAGES INCLUDED?)

# Install packages not yet installed
packages <- c("caret", "corrplot", "devtools", "dplyr", "gglm", "ggplot2", "gratia", "gridExtra", "knitr", "lubridate", "mdatools", "mgcv", "MuMIn", "opusreader2", "prospectr", "splitstackshape","shiny", "tidyr", "viridis")
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  utils::install.packages(pkgs = packages[!installed_packages])
}
invisible(lapply(packages, library, character.only = TRUE)) # load all packages in list
# install packages not on CRAN
if (!require("remotes")) install.packages("remotes")
if (!require("opusreader2")) {
  remotes::install_github("spectral-cockpit/opusreader2")
  library("opusreader2")
}
# if (!require("simplerspec")) {
#   remotes::install_github("philipp-baumann/simplerspec")
#   library("simplerspec")
# }
rm(installed_packages, packages) # remove objects from environment

Variables - Metadata

Some notes on metadata variables

  • Specimen # in dataset (specimen)
  • Fish fork length (to nearest whole mm) (length)
  • Fish whole body weight (to nearest whole gram), may have been taken when partially frozen (weight)
  • Weight of otolith (nearest 0.0001 gram) used for FT-NIRS scan (structure_weight)
  • otolith read age (read_age) in days representing an average of multiple reads (agreement <= 5%)
  • side of otolith scanned (side)
  • % of otolith with some sort of abnormality including crystalization or broken/chipped portions (percent_affected)
  • other problems include tissue stuck to otoliths (other_problem)
  • whether an otolith was broken or chipped at all (broken)
  • scan name/session or run number (scan_name, run_number) to indicate separate scans for the same specimen (LPW specimens scanned in triplicate)
  • Filename of .0 opus files for each specimen & scan session (file_name)

Importing FT-NIRS Spectra

The Bruker MPA outputs FT-NIRS spectra in a .0 format. Prior to loading spectral data, a metadata .csv (LPW_metadata.csv) was generated to pair specimens biological data with their respective FT-NIRS scans.

To facilitate easier .0 file loading, all file names for .0 files should be added to the metadata file. All filenames in a folder can be generated with list.files(). For pulling file names to create a new .csv, you can use:

This output (without index numbers in square brackets) can be copied and pasted into Excel, then converted into columns using the “Data -> Text to Columns” function and selecting a “space” delimiter. These columns can then be transposed into a file_name column.

Metadata

First I import my metadata (LPW_metadata.csv) and convert the session_title (i.e. scan #) to a factor. LPW FT-NIRS scans were taken in triplicate (NIR_LPW202001202_otoliths, ...otoliths_rescan_1, ...otoliths_rescan_2) to see if spectra changed after storage. Metadata for LPW contained an additional entry (..._rescan_3) with some notes and comments, however I will only be importing the metadata for actual scans.

# 66 missing from first 2 scans, 3 & 4 have it though (removed from dataset for now, may incorporate back in)
meta_LPW <- read.csv("metadata/LPW_metadata.csv") # import metadata
levels(as.factor(as.character(meta_LPW$session_title))) # set session_title to factor (three scans taken at separate times).
## [1] "NIR_LPW202001202_otoliths"          "NIR_LPW202001202_otoliths_rescan_1"
## [3] "NIR_LPW202001202_otoliths_rescan_2" "NIR_LPW202001202_otoliths_rescan_3"
meta_LPW <- meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths", "NIR_LPW202001202_otoliths_rescan_1", "NIR_LPW202001202_otoliths_rescan_2", "NIR_LPW202001202_otoliths_rescan_3"), ]
meta_LPW$specimen[meta_LPW$file_name == ""] # check for missing file_names in metadata
## integer(0)
meta_LPW <- meta_LPW[meta_LPW$file_name != "", ] # remove any rows that are missing a file_name in the metadata
# check number of specimens for each scan session (122 specimens, 365 total scans after excluding specimen 66, scan 1)
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_1"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_2"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_3"), ])
## [1] 121
meta_LPW$sample_date <- lubridate::as_date(mdy(meta_LPW$sample_date)) # convert sample date to lubridate format
# generate file paths to import with Opusreader
meta_LPW$file_path <- paste0(getwd(), "/FT-NIRS_spectra/LPW_scans/", meta_LPW$file_name) # generate filepaths for all .0 files
Opusfiles <- as.vector(meta_LPW$file_path) # store filepaths in object
exists <- as.vector(lapply(Opusfiles, file.exists)) # check that I have all my files or else I get an error when I read them in
meta_LPW$exists <- exists # add column to metadata dataframe to confirm file exists in directory
meta_LPW1 <- meta_LPW[meta_LPW$exists == "TRUE", ] # filter the file list and data by otoliths with spectra files
Opusfiles <- as.vector(meta_LPW1$file_path) # from Esther, "I repeated this and wrote over it so I wouldn't have extra files to read in that don't exist and produce an error"
rm(exists)
rm(meta_LPW1)

Opus files

Once all metadata has been imported into a dataframe with filepaths for .0 files, FT-NIRS scans can be imported using the opusreader2 package and read_opus(). First a single file is read-in to make sure the script works and everything looks good; then all .0 files are imported using the Opusfiles object. read_opus() creates a massive list of lists that includes information about MPA II settings, metadata generated during scans, specimen names, etc. See ?read_opus() from the opusreader2 package for more details on the list structure of .0 files.

# read a single file (one measurement) to check that script runs without failure
file <- Opusfiles[1]
data_list <- opusreader2::read_opus(dsn = file) # NA's seem to be introduced due to a timestamp metadata issue, currently using suppressWarnings to hide the annoying output, but this is normal.
rm(data_list)
rm(file)
SPCfiles_nooffset <- suppressWarnings(lapply(Opusfiles, opusreader2::read_opus)) # Read in all files in the Opusfiles list.  This gives an error if any file names or paths are wrong.

# uncomment to see additional output from .0 files

# str(SPCfiles_nooffset[[1]]) # check first element
# SPCfiles_nooffset[[1]][[1]]$ab$data # can see spc values this way I think
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters #this has info about what setting was used (here otolith), species, and file name
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters$FC2$parameter_value # species
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters$FD1$parameter_value # unique ID, including project title (LPW2020), experimental designation (01 for experimental, mortalities are numbered 02), species code (202), specimen number (_1_) and scan number (OA1)
# SPCfiles_nooffset[[1]][[1]]$ab$wavenumbers # wavenumbers scanned
# SPCfiles_nooffset[[1]][[1]]$instrument_ref$parameters$INS$parameter_value # instrument name

Once all .0 files with included metadata have been successfully read-in, FT-NIRS scan data can be extracted from the lists.

# extract absorbance measurements, wavenumbers scanned, unique specimen ID, species  & instrument name from .0 files list
spectra <- lapply(SPCfiles_nooffset, function(x) x[[1]]$ab$data) # FT-NIRS scan data, absorbance measurements at all 949 wavenumbers
wavenumber <- lapply(SPCfiles_nooffset, function(x) x[[1]]$ab$wavenumbers) # list of all wavenumbers scanned
file_id <- lapply(SPCfiles_nooffset, function(x) x$lab_and_process_param_raw$parameters$FD1$parameter_value) # unique ID in Opus
species <- lapply(SPCfiles_nooffset, function(x) x$lab_and_process_param_raw$parameters$FC2$parameter_value) # species
instrument <- lapply(SPCfiles_nooffset, function(x) x[[1]]$instrument_ref$parameters$INS$parameter_value) # instrument name
# create dataframe out of list
spectra <- lapply(spectra, as.data.frame)
# add metadata to dataframe
for (i in 1:length(spectra)) {
  colnames(spectra[[i]]) <- wavenumber[[i]] # assign wavenumbers to columns of absorbance measurements
}
for (i in 1:length(spectra)) {
  spectra[[i]]$species <- species[[i]]
}
for (i in 1:length(spectra)) {
  spectra[[i]]$file_id <- file_id[[i]]
}
for (i in 1:length(spectra)) {
  spectra[[i]]$instrument <- instrument[[i]]
}
for (i in 1:length(spectra)) {
  spectra[[i]]$file_path <- Opusfiles[[i]]
}
# extract file_name from file_path: test code on one file to
# try <- spectra[[1]] # specimen 1, scan session 1
# splitstackshape::cSplit(as.data.frame(try$file_path), sep = "/", splitCols = "try$file_path", type.convert = FALSE) %>% select(tail(names(.), 1))
# rm(try)
# create file_name variable for all scans
file_name <- lapply(spectra, function(x) splitstackshape::cSplit(as.data.frame(x$file_path), sep = "/", splitCols = "x$file_path", type.convert = FALSE) %>% select(tail(names(.), 1)))
# file_name[[1]][[1, 1]] # check that first element is correct
# add file_name variable to spectra dataframe
for (i in 1:length(spectra)) {
  spectra[[i]]$file_name <- file_name[[i]][[1, 1]]
}
rm(file_id, file_name, instrument, species, wavenumber, Opusfiles, i, SPCfiles_nooffset)

Combine dataframes

df <- as.data.frame(do.call(rbind, spectra)) # convert spectral list/dataframe to dataframe object
dfmeta_LPW <- dplyr::left_join(meta_LPW, df, by = c("file_name", "file_path")) # join metadata to spectral data
rm(df)
dfmeta_LPW <- dfmeta_LPW %>% select(-c("exists", "instrument")) # remove exists and instrument columns
dfmeta_LPW$run_number <- as.factor(dfmeta_LPW$run_number)
rm(meta_LPW, spectra)
colnames(dfmeta_LPW) <- as.character(colnames(dfmeta_LPW)) # make sure column names are characters, not numeric
dfmeta_longLPW <- pivot_longer(dfmeta_LPW, cols = -c(1:20)) # pivot absorbance measurements to long format for easy plotting
dfmeta_longLPW <- dfmeta_longLPW %>% rename(., "wavenumber" = "name") # rename name column to wavenumber for clarification
dfmeta_longLPW$wavenumber <- as.numeric(as.character(dfmeta_longLPW$wavenumber)) # change class of wavenumber variable to a numeric
# Plot 7 specimens to see if data loaded properly.

Plot check

ggplot() +
  geom_path(
    data = dfmeta_longLPW[dfmeta_longLPW$specimen %in% c(1, 10, 100, 101, 102, 103, 104), ],
    aes(x = wavenumber, y = value, color = session_title, group = file_name), linewidth = .5
  ) +
  scale_x_reverse() +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1))) +
  theme(
    axis.text = element_text(size = 10),
    axis.text.x = element_text(size = 12, angle = 25),
    axis.title = element_text(size = 12),
    legend.position = "none",
    strip.text = element_text(size = 14),
    legend.text = element_text(size = 10),
    legend.title = element_text(size = 12),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black")
  ) +
  facet_wrap(~specimen)

# Save RDS for data transparency and easily enable others to work with dataframe without all these complicated import steps
# saveRDS(dfmeta_LPW, "RDS_dataframes/LPW_dfmeta.RDS")

Exploratory Analysis

Functions

Some functions useful for exploring data, including quick plotting PCA and seeing impact of different preprocessing filters

# ggplot function to plot spectra, colored by some factor
spec.fig <- function(mydf, color) {
  # color <- as.character({{color}})
  ggplot(mydf) +
    geom_path(aes(x = name, y = value, color = {{ color }}, group = file_name)) +
    scale_x_reverse() +
    scale_color_viridis() +
    labs(y = "Preprocessed absorbance", x = expression(paste("Wavenumber ", cm^-1)))
  # theme(axis.text = element_text(size = 16),
  # axis.text.x = element_text(size = 12, angle = 25),
  # axis.title = element_text(size = 14),
  # legend.position = "right",
  # strip.text = element_text(size = 14),
  # legend.text = element_text(size = 14),
  # legend.title = element_text(size = 14),
  # panel.grid.major = element_blank(),
  # panel.grid.minor = element_blank(),
  # panel.background = element_blank(),
  # axis.line = element_line(colour = "black")
  # )
}
# function to plot results of different savitzkyGolay filter params
sg_plotting <- function(color, m, p, w) {
  dftempproc <- as.data.frame(
    cbind(
      scan_avg[, c(1:20)],
      savitzkyGolay(scan_avg[, 21:length(scan_avg)], m = m, p = p, w = w)
      )
    )
  dftempproc_long <- tidyr::pivot_longer(dftempproc, cols = c(21:length(dftempproc)))
  dftempproc_long$name <- as.numeric(as.character(dftempproc_long$name))
  spec.fig(mydf = dftempproc_long, color = {{ color }}) +
    ggtitle(paste("diff = ", {{ m }}, "poly = ", {{ p }}, "window = ", {{ w }}))
}
# function to apply savitzkyGolay filter with selected parameters to dataframe and create new temp_proc and temp_proc_long dataframes for modelling
quickproc <- function(df, m, p, w) {
  temp_proc <<- as.data.frame(
    cbind(
      {{ df }}[, c(1:20)],
      savitzkyGolay({{ df }}[, 21:length({{ df }})], m = m, p = p, w = w)
    )
  )
  temp_proc_long <<- tidyr::pivot_longer(temp_proc, cols = c(21:length(temp_proc)))
  temp_proc_long$name <<- as.numeric(as.character(temp_proc_long$name))
}

Preliminary spectra and PCA plots were explored to see whether spectra of different scans were similar enough to warrant combining for analyses. Previous research has shown FT-NIRS spectra can change quite a bit over time and the triplicate scans of LPW otoliths were collected over nearly a year between the first and third scans.

Spectra - Scans 1-4

# Plot FT-NIRS scans, all 4 runs, colored by run_number
ggplot(dfmeta_longLPW) +
  geom_path(aes(x = wavenumber, y = value,
                color = run_number, group = file_name), 
            alpha = 0.5) +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) +
  scale_x_reverse() + # FT-NIRS spectra often displayed with reversed x-axis for otoliths - lower wavenumbers generally more informative and are more easily seen away from axes
  ggtitle("LPW FT-NIRS Spectra, All Scan Sessions")

There appear to be slightly different values for each scan - First I look at pairings of scans

Spectra - Scan pairings

plot1 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(1, 2), ]) +
  geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) +
  scale_x_reverse() +
  ggtitle("LPW FT-NIRS Spectra, Scans 1 & 2")
# Scans 1 & 3
plot2 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(1, 3), ]) +
  geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) +
  scale_x_reverse() +
  ggtitle("LPW FT-NIRS Spectra, Scans 1 & 3")
# Scans 2 & 3
plot3 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(2, 3), ]) +
  geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) +
  scale_x_reverse() +
  ggtitle("LPW FT-NIRS Spectra, Scans 2 & 3")
# Scans 2 & 3
plot4 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(3, 4), ]) +
  geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) +
  scale_x_reverse() +
  ggtitle("LPW FT-NIRS Spectra, Scans 3 & 4")
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)

rm(plot1, plot2, plot3, plot4)

Scan 1 looks quite different than 2, 3 & 4. These should be triplicate scans with identical values for each of the 121 specimens. To confirm, I extract/plot the first 2 principal components for all FT-NIRS wavenumbers (indices 21:ncol(dfmeta_LPW)) and color points by run number.

PCA - Scans 1-4

pca_all_LPW <- pca(dfmeta_LPW[21:ncol(dfmeta_LPW)], scale = T)
pcs <- as.data.frame(cbind(pc2 = pca_all_LPW$calres$scores[, 2], run_number = dfmeta_LPW$run_number)) # extract scores for PC2
pcs <- cbind(pc1 = pca_all_LPW$calres$scores[, 1], pcs) # extract scores for PC1
# plot PC1 & PC2, colored by run_number
ggplot(pcs) +
  geom_point(aes(x = pc1, y = pc2, color = as.factor(run_number)), size = 3) +
  labs(x = paste("PC1 (",
                 round(pca_all_LPW$calres$expvar[1], digits = 3), # variance explained by PC1
                 "% var. explained )"),
       y = paste("PC2 (", 
                 round(pca_all_LPW$calres$expvar[2], digits = 3), # variance explained by PC2
                 "% var. explained )"),
       color = "Run Number"
  ) +
  scale_color_viridis(discrete = T)

rm(pca_all_LPW, pcs, dfmeta_longLPW)

Scan #1 confirmed to be quite different. Something appears to have changed between scan periods, however it is unknown exactly what occurred. Approximately 10 months elapsed between scan 1 & scan 2. All future analysis will combine absorbance measurements from scans 2, 3 & 4 and average.

Average Scans 2-4

Combining/averaging LPW FT-NRIS scans 2, 3 & 4 preprocessing & filtering spectra >7500 cm-1. Combining scans 2, 3 & 4 into a single average scan. Scan 1 is being ignored due to PCA above

Spectral preprocessing serves to remove unwanted noise from absorbance measurements, filter out NIR backscatter, and can improve model fits by increasing variability of absorbance measurements at wavenumbers between specimens. Initial preprocessing done with prospectr and a savitzkygolay filter. Will stick with one method and explore others later if time allows.

Removing noisy stretch of spectra > 7500 cm-1 - similar approach done by Matta et al. in Walleye Pollock daily age paper and seems typical for otolith FT-NIRS work.

scan_2 <- dfmeta_LPW %>% dplyr::filter(run_number == 2)
scan_3 <- dfmeta_LPW %>% dplyr::filter(run_number == 3)
scan_4 <- dfmeta_LPW %>% dplyr::filter(run_number == 4)
scan_avg <- bind_cols(NULL, scan_2[, 1:20])
scan_avg <- bind_cols(scan_avg, (scan_2[, 21:ncol(scan_2)] + scan_3[, 21:ncol(scan_3)] + scan_4[, 21:ncol(scan_4)]) / 3)
rm(scan_2, scan_3, scan_4, dfmeta_LPW)
# saveRDS(scan_avg, "RDS_dataframes/LPW_scan_avg_unproc.RDS")
# preprocess
scan_avg_filter <- cbind(scan_avg[, c(1:20)], savitzkyGolay(scan_avg[, 21:length(scan_avg)], m = 1, p = 3, w = 17))
scan_avg_filter <- scan_avg_filter[, -c(21:517)] # removing the noisy stretch of spectra (>7500, up to 7504)
rm(scan_avg)
# saveRDS(scan_avg_filter, "RDS_dataframes/LPW_scan_avg_proc_filter.RDS")

RDS files

# dfmeta_LPW <- readRDS("RDS_dataframes/LPW_dfmeta.RDS")
# scan_avg <- readRDS("RDS_dataframes/LPW_scan_avg_unproc.RDS")
# scan_avg_filter <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter")

Model Age - FT-NIRS

MLRs & GAMs

PCA

Look for outliers in first 2 PC’s, colored by age

scan_avg_filter <- scan_avg_filter[complete.cases(scan_avg_filter$read_age), ] # filter only aged specimens
pca_temp <- pca(scan_avg_filter[, 21:ncol(scan_avg_filter), ])
scan_avg_filter$PC1 <- pca_temp$res$cal$scores[, 1]
scan_avg_filter$PC2 <- pca_temp$res$cal$scores[, 2]
rm(pca_temp)
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(
  ggplot() +
    geom_point(data = scan_avg_filter[!c(scan_avg_filter$specimen %in% rm_specimens), ],
               aes(x = PC1, y = PC2, color = read_age), size = 5
    ) +
    geom_point(data = scan_avg_filter[c(scan_avg_filter$specimen %in% rm_specimens), ],
               aes(x = PC1, y = PC2, color = read_age), shape = 17, size = 7
    ) +
    scale_color_viridis() +
    geom_point()
)
rm(rm_specimens)

Removing above identified outliers

scan_avg_filter <- scan_avg_filter %>%
  dplyr::filter(read_age != 188, read_age != 181, read_age != 175, read_age != 161, read_age != 147, read_age != 135, specimen != 65)
# store metrics from each fold and each model type
RMSE.age <- list()
r2.age <- list()
AIC.age <- list()
AICc.age <- list()
mod.summary <- data.frame(
  model = c("lm", "gam", "pls","pls_vip"),
  r2 = 1:4,
  RMSE = 1:4,
  AIC = c(0,0,NA,NA),
  AICc = c(0,0,NA,NA))
# 10 fold CV split - create folds
set.seed(6)
splits <- caret::createFolds(scan_avg_filter$read_age, k = 10, list = TRUE, returnTrain = FALSE)
# extract PC's for each calibration set, create test sets with ages and spectra
cal <- list()
test <- list()
for (i in 1:10) {
  # calibration set and PC's
  pc.mod <- preProcess(scan_avg_filter[-splits[[i]], -c(1:20)], method = "pca", thresh = 0.95, pcaComp = 10)
  pc.cal <- predict(pc.mod, scan_avg_filter[-splits[[i]], -c(1:20)])
  pc.cal <- cbind(pc.cal, scan_avg_filter[-splits[[i]], ])
  cal[[i]] <- pc.cal
  rm(pc.cal)
  # test sets
  pc.test <- predict(pc.mod, scan_avg_filter[splits[[i]], -c(1:20)])
  pc.test <- cbind(pc.test, scan_avg_filter[splits[[i]], ])
  test[[i]] <- pc.test
  rm(pc.test, pc.mod)
}
# determine which PC's to include via step & AIC selection
mod.sel <- list()
for (i in 1:10) {
  pctest <- cal[[i]]
  temp <- step(lm(data = pctest[-splits[[i]], ], read_age ~ PC1 + PC2 + PC3 + PC4))
  mod.sel[[i]] <- rownames(summary(temp)$coef)
  rm(pctest, temp)
}
knitr::kable(table(unlist(mod.sel)), align = "c")
Var1 Freq
(Intercept) 10
PC1 10
PC2 6
PC3 10
PC4 8
rm(mod.sel)

PC 2 will be excluded for now.

MLR

# lm() with 10-fold CV
mods.lm <- list()
for (i in 1:10) {
  calibrate <- cal[[i]]
  testing <- test[[i]]
  mod <- lm(data = calibrate, read_age ~ PC1 + PC3 + PC4)
  # RMSE.age$lm.cal[i] <- caret::RMSE(pred = mod$fitted.values, obs = calibrate[, "read_age"])
  preds <- predict(mod, newdata = testing)
  RMSE.age$lm.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
  # r2.age$lm.cal[i] <- summary(mod)$r.squared
  RSS <- sum((testing$read_age - preds)^2)
  TSS <- sum((testing$read_age - mean(testing$read_age))^2)
  r2.age$lm.test[i] <- 1 - (RSS / TSS)
  AIC.age$lm[i] <- AIC(mod)
  AICc.age$lm[i] <- AICc(mod)
  mods.lm[[i]] <- mod
  rm(mod, preds, RSS, TSS)
}
mod.summary[1,2] <- round(mean(r2.age$lm.test), 3)
mod.summary[1,3] <- round(mean(RMSE.age$lm.test), 3)
mod.summary[1,4] <- round(mean(AIC.age$lm) , 3)
mod.summary[1,5] <- round(mean(AICc.age$lm) , 3)
# RMSE.age$lm.cal <- mean(RMSE.age$lm.cal)
# r2.age$lm.cal <- mean(r2.age$lm.cal)

GAM

# GAM with 10 fold CV, select = T allows PCs to be penalized and effectively removed from model if appropriate.
mods.GAM <- list()
for (i in 1:10) {
  calibrate <- cal[[i]]
  testing <- test[[i]]
  mod <- gam(data = calibrate, read_age ~ s(PC1) + s(PC2) + s(PC3) + s(PC4), method = "REML")
  # Extract AIC, AICc, RMSE (cal & test) & r2 (cal & test)
  # RMSE.age$GAM.cal[i] <- caret::RMSE(pred = mod$fitted.values, obs = calibrate[, "read_age"])
  preds <- predict(mod, newdata = testing)
  RMSE.age$GAM.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
  # r2.age$gam.cal[i] <- summary(mod)$r.sq
  RSS <- sum((testing$read_age - preds)^2)
  TSS <- sum((testing$read_age - mean(testing$read_age))^2)
  r2.age$gam.test[i] <- 1 - (RSS / TSS)
  AIC.age$gam[i] <- AIC(mod)
  AICc.age$gam[i] <- AICc(mod)
  mods.GAM[[i]] <- mod
  rm(RSS, TSS, preds, i, mod, calibrate, testing)
}
mod.summary[2,2] <- round(mean(r2.age$gam.test), 3)
mod.summary[2,3] <- round(mean(RMSE.age$GAM.test), 3)
mod.summary[2,4] <- round(mean(AIC.age$gam), 3)
mod.summary[2,5] <- round(mean(AICc.age$gam), 3)
rm(AIC.age, AICc.age)

PLS

mods.pls <- list()
mods.vip <- list()
# out.mods <- list()
# pls, no variable selection & VIP > 1
for (i in 1:10) {
  calibrate <- scan_avg_filter[-splits[[i]], ]
  testing <- scan_avg_filter[splits[[i]], ]
  mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "read_age"],
    ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
    info = "Age Prediction Model",
    x.test = testing[, 31:ncol(testing)],
    y.test = testing[, "read_age"]
  )
  ncomp <- mod$ncomp.selected
  RMSE.age$pls.test[i] <- mod$testres$rmse[[ncomp]]
  r2.age$pls.test[i] <- mod$testres$r2[[ncomp]]
  mods.pls[[i]] <- mod
  mods.pls[[1]]$ncomp.selected
  # r2.age$pls.cal[i] <- mod$calres$r2[[ncomp]]
  # RMSE.age$pls.cal[i] <- mod$calres$rmse[[ncomp]]
        #  # Exclude outliers from above
        #  outliers <- which(categorize(mod, mod$res$cal) == "extreme")
        #  mod.out <- pls(calibrate[-outliers, 31:ncol(calibrate)], calibrate[-outliers, "read_age"],
        #                 ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
        #                 info = "Age Prediction Model",
        #                 x.test = testing[-outliers, 31:ncol(testing)],
        #                 y.test = testing[-outliers, "read_age"]
        #  )
        #  ncomp <- mod.out$ncomp.selected
        # # RMSE.age$pls.cal[i] <- mod$calres$rmse[[ncomp]]
        #  RMSE.age$pls.outliers[i] <- mod.out$testres$rmse[[ncomp]]
        # #  r2.age$pls.cal[i] <- mod$calres$r2[[ncomp]]
        #  r2.age$pls.outliers[i] <- mod.out$testres$r2[[ncomp]]
        #  out.mods[[i]] <- mod.out
        # VIP < 1 excluded
  vip <- as.data.frame(vipscores(mod))
  mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "read_age"],
    ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
    info = "Age Prediction Model",
    x.test = testing[, 31:ncol(testing)],
    y.test = testing[, "read_age"],
    exclcols = vip$V1 < 1
  )
  ncomp <- mod$ncomp.selected
  r2.age$vip.test[i] <- mod$testres$r2[[ncomp]]
  mods.vip[[i]] <- mod
  RMSE.age$vip.test[i] <- mod$testres$rmse[[ncomp]]
  # r2.age$vip.cal[i] <-  mod$calres$r2[[ncomp]]
  # RMSE.age$vip.cal[i] <-  mod$calres$rmse[[ncomp]]
  rm(calibrate, mod, ncomp, testing, i, vip)
}
# RMSE.age$pls.cal <- mean(RMSE.age$pls.cal)
# RMSE.age$pls.outliers <- mean(RMSE.age$pls.outliers)
# r2.age$pls.cal <- mean(r2.age$pls.cal)
# r2.age$pls.outliers <- mean(r2.age$pls.outliers)
# RMSE.age$vip.cal <- mean(RMSE.age$vip.cal)
mod.summary[3, 2] <- round(mean(r2.age$pls.test), 3)
mod.summary[3, 3] <- round(mean(RMSE.age$pls.test), 3)
mod.summary[4, 2] <- round(mean(r2.age$vip.test), 3)
mod.summary[4, 3] <- round(mean(RMSE.age$vip.test), 3)
# sapply(mods.vip,plot)
# sapply(mods.pls,plot)
rm(r2.age, RMSE.age)
knitr::kable(mod.summary, align = "c")
model r2 RMSE AIC AICc
lm 0.766 11.865 383.176 384.585
gam 0.753 11.779 370.186 380.351
pls 0.805 10.345 NA NA
pls_vip 0.785 10.324 NA NA

Shiny

# Define UI
ui <- fluidPage(
  titlePanel("Model Plotter"),
  # Radio button to select model list
  radioButtons(inputId = "model_list", 
               label = "Select model list:", 
               choices = c("MLR", "GAM", "GAM - Partial", "PLS", "PLS - VIP > 1")),
  # SelectInput to select model index
  selectInput(inputId = "model_index",
              label = "Select model:",
              choices = as.character(1:10)),
  # Output panel for plot
  mainPanel(
    plotOutput(outputId = "model_plot")
  )
)
# Define server logic
server <- function(input, output) {
  model_list <- reactive({
    if (input$model_list == "PLS") {
      mods.pls} 
    else if (input$model_list == "PLS - VIP > 1") {
      mods.vip} 
    else if (input$model_list == "MLR") {
      mods.lm} 
    else if (input$model_list == "GAM") {
      mods.GAM} 
    else if (input$model_list == "GAM - Partial") {
      mods.GAM}
  })
  output$model_plot <- renderPlot({
    # Get selected model  
    select_model <- model_list()
    # Plot the model using appropriate plot function
    if (input$model_list == "GAM") {
      appraise(select_model[[as.integer(input$model_index)]])} 
    else if (input$model_list == "GAM - Partial") {
    draw(select_model[[as.integer(input$model_index)]])} 
    else {switch(input$model_list,
         "PLS" = plot(select_model[[as.integer(input$model_index)]]),
         "PLS - VIP > 1" = plot(select_model[[as.integer(input$model_index)]]),
         "MLR" = gglm(select_model[[as.integer(input$model_index)]]))}
    })
}
shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

Plots

Correlation

Show high correlation of sequential wavenumbers. 100 shown here.

corrplot::corrplot(cor(scan_avg_filter[, 150:250]), tl.pos = "n", cl.cex = 2)

Spectra

Raw vs preprocessed and filtered

One specimen

20 specimens

All specimens colored by age

Model performance - MLR & GAM

Diagnostics

LM

# Using test/cal split # 5 for model diagnostics
# ggplot version of lm diagnostics
gglm(mods.lm[[5]], theme = theme_bw(), 
     theme(plot.title = element_text(size = 22), 
           axis.title = element_text(size = 20), 
           axis.text = element_text(size = 18))
     )

GAM

# Using test/cal split # 5 for model diagnostics
# GAM diagnostics
appraise(mods.GAM[[5]]) & 
  theme_bw() &
  theme(plot.title = element_text(size = 15),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12)
  )

# GAM partial effects
draw(mods.GAM[[5]], residuals = T) & 
  theme_bw() &
  theme(plot.title = element_text(size = 15), 
        axis.title = element_text(size = 14), 
        axis.text = element_text(size = 12)
        )

Length vs sample date

Demonstrate fish got larger over time

ggplot(scan_avg_filter, aes(x = sample_date, y = length)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", col = "black") +
  theme(
    axis.title = element_text(size = 20),
    axis.text = element_text(size = 15)
  ) + 
  theme_bw() +
  xlab("Sample Date") +
  ylab("Fork length (mm)")
## `geom_smooth()` using formula = 'y ~ x'

PCA - Outliers

Colored by length with outliers removed marked as triangles

age_only <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
age_only <- age_only[complete.cases(age_only$read_age), ]
pca_temp <- pca(age_only[,21:ncol(age_only),])
age_only$PC1 <- pca_temp$res$cal$scores[,1]
age_only$PC2 <- pca_temp$res$cal$scores[,2]
rm(pca_temp)
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
library(plotly)
ggplotly(
  ggplot() + 
    geom_point(data = age_only[!c(age_only$specimen %in% rm_specimens),], 
               aes(x = PC1, y = PC2, color = read_age), size = 5) + 
    geom_point(data = age_only[c(age_only$specimen %in% rm_specimens),], 
               aes(x = PC1, y = PC2, color = read_age), shape = 17, size = 7) +
    scale_color_viridis()
)
rm(age_only, rm_specimens)

Model Age - length + structure weight

# Create dataframe only with ages contained in metadata
age_only <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
age_only <- age_only[complete.cases(age_only$read_age), ]
age_only <- age_only[complete.cases(age_only$structure_weight), ]
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
age_only <- age_only %>% filter(!specimen %in% rm_specimens)
rm(rm_specimens)
# pca_temp <- pca(scan_avg_filter[,21:ncol(scan_avg_filter),])

MLR

# store metrics from each fold and each model type
RMSE.age <- list()
r2.age <- list()
AIC.age <- list()
AICc.age <- list()
mod.lw.summary <- data.frame(
  model = c("lm", "gam"),
  r2 = 1:2,
  RMSE = 1:2,
  AIC = c(0,0),
  AICc = c(0,0))
set.seed(6)
splits <- caret::createFolds(age_only$read_age, k = 10, list = TRUE, returnTrain = FALSE)
mods.lw.lm <- list()
for (i in 1:10) {
  calibrate <- age_only[-splits[[i]],]
  testing <- age_only[splits[[i]],]
  mod <- lm(data = calibrate, read_age ~ length + structure_weight)
  preds <- predict(mod, newdata = testing)
  RMSE.age$lm.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
  RSS <- sum((testing$read_age - preds)^2)
  TSS <- sum((testing$read_age - mean(testing$read_age))^2)
  r2.age$lm.test[i] <- 1 - (RSS / TSS)
  AIC.age$lm[i] <- AIC(mod)
  AICc.age$lm[i] <- AICc(mod)
  mods.lw.lm[[i]] <- mod
  rm(mod, preds,RSS,TSS)
}
mod.lw.summary[1,2] <- round(mean(r2.age$lm.test),3)
mod.lw.summary[1,3] <- round(mean(RMSE.age$lm.test),3)
mod.lw.summary[1,4] <- round(mean(AIC.age$lm),3)
mod.lw.summary[1,5] <- round(mean(AICc.age$lm),3)

GAM

mods.lw.GAM <- list()
for (i in 1:10) {
  calibrate <- age_only[-splits[[i]], ]
  testing <- age_only[splits[[i]], ]
  mod <- gam(data = calibrate, read_age ~ s(length) + s(structure_weight), method = "REML")
  # Extract AIC, AICc, RMSE (cal & test) & r2 (cal & test)
  # RMSE.age$GAM.cal[i] <- caret::RMSE(pred = mod$fitted.values, obs = calibrate[, "read_age"])
  preds <- predict(mod, newdata = testing)
  RMSE.age$GAM.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
  # r2.age$gam.cal[i] <- summary(mod)$r.sq
  RSS <- sum((testing$read_age - preds)^2)
  TSS <- sum((testing$read_age - mean(testing$read_age))^2)
  r2.age$gam.test[i] <- 1 - (RSS / TSS)
  AIC.age$gam[i] <- AIC(mod)
  AICc.age$gam[i] <- AICc(mod)
  mods.lw.GAM[[i]] <- mod
  rm(mod, preds, RSS,TSS)
}
mod.lw.summary[2,2] <- round(mean(r2.age$gam.test),3)
mod.lw.summary[2,3] <- round(mean(RMSE.age$GAM.test),3)
mod.lw.summary[2,4] <- round(mean(AIC.age$gam),3)
mod.lw.summary[2,5] <- round(mean(AICc.age$gam),3)
rm(AIC.age,AICc.age, RMSE.age,r2.age)
knitr::kable(mod.lw.summary, align = "c")
model r2 RMSE AIC AICc
lm 0.756 11.082 355.694 356.672
gam 0.760 10.833 348.363 352.259

Model Comparison

Modeling with PCs
model r2 RMSE AIC AICc
lm 0.766 11.865 383.176 384.585
gam 0.753 11.779 370.186 380.351
pls 0.805 10.345 NA NA
pls_vip 0.785 10.324 NA NA
Modeling with length and otolith structure weight
model r2 RMSE AIC AICc
lm 0.756 11.082 355.694 356.672
gam 0.760 10.833 348.363 352.259

Model Structure Weight - PLS

PLS

Predict structure weight using FT-NIRS

# Create dataframe only with ages contained in metadata
age_only <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
age_only <- age_only[complete.cases(age_only$read_age), ]
age_only <- age_only[complete.cases(age_only$structure_weight), ]
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
age_only <- age_only %>% filter(!specimen %in% rm_specimens)
rm(rm_specimens)
# pca_temp <- pca(scan_avg_filter[,21:ncol(scan_avg_filter),])
set.seed(6)
splits <- caret::createFolds(age_only$structure_weight, k = 10, list = TRUE, returnTrain = FALSE)
mods.sw.pls <- list()
mods.sw.vip <- list()
RMSE.weight <- list()
r2.weight <- list()
mod.sw.summary <- data.frame(
  model = c("pls", "pls_vip"),
  r2 = 1:2,
  RMSE = 1:2)
# test pls, no variable selection
for (i in 1:10) {
  calibrate <- age_only[-splits[[i]], ]
  testing <- age_only[splits[[i]], ]
  mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "structure_weight"],
             ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
             info = "Otolith Weight Prediction Model - No VIP",
             x.test = testing[, 31:ncol(testing)], 
             y.test = testing[, "structure_weight"])
  ncomp <- mod$ncomp.selected
  # RMSE.age$pls.cal[i] <- mod$calres$rmse[[ncomp]]
  RMSE.weight$pls.test[i] <- mod$testres$rmse[[ncomp]]
 #  r2.age$pls.cal[i] <- mod$calres$r2[[ncomp]]
  r2.weight$pls.test[i] <- mod$testres$r2[[ncomp]]
  mods.sw.pls[[i]] <- mod
  
# VIP < 1 excluded
  vip <- as.data.frame(vipscores(mod))
  mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "structure_weight"],
                 ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
                 info = "Otolith Weight Prediction Model - VIP",
                 x.test = testing[, 31:ncol(testing)],
                 y.test = testing[, "structure_weight"],
                 exclcols = vip$V1 < 1
  )
  ncomp <- mod$ncomp.selected
  # RMSE.age$vip.cal[i] <-  mod$calres$rmse[[ncomp]]
  RMSE.weight$vip.test[i] <-  mod$testres$rmse[[ncomp]]
  # r2.age$vip.cal[i] <-  mod$calres$r2[[ncomp]]
  r2.weight$vip.test[i] <-  mod$testres$r2[[ncomp]]
  mods.sw.vip[[i]] <- mod
  rm(mod, ncomp, vip)
}
mod.sw.summary[1,2] <- round(mean(r2.weight$pls.test),3)
mod.sw.summary[1,3] <- round(mean(RMSE.weight$pls.test),7)
mod.sw.summary[2,2] <- round(mean(r2.weight$vip.test),3)
mod.sw.summary[2,3] <- round(mean(RMSE.weight$vip.test),7)
rm(r2.weight,RMSE.weight)
knitr::kable(mod.sw.summary, align = "c")
model r2 RMSE
pls 0.996 0.0003909
pls_vip 0.995 0.0004309